Implementing asyncronous collective operations in a multi-node processing system

ABSTRACT

A method, system, and computer program product are disclosed for implementing an asynchronous collective operation in a multi-node data processing system. In one embodiment, the method comprises sending data to a plurality of nodes in the data processing system, broadcasting a remote get to the plurality of nodes, and using this remote get to implement asynchronous collective operations on the data by the plurality of nodes. In one embodiment, each of the nodes performs only one task in the asynchronous operations, and each nodes sets up a base address table with an entry for a base address of a memory buffer associated with said each node. In another embodiment, each of the nodes performs a plurality of tasks in said collective operations, and each task of each node sets up a base address table with an entry for a base address of a memory buffer associated with the task.

CROSS-REFERENCE TO RELATED APPLICATIONS

This application claims the benefit of U.S. Patent Application Ser. Nos.61/261,269, filed Nov. 13, 2009 for “LOCAL ROLLBACK FOR FAULT-TOLERANCEIN PARALLEL COMPUTING SYSTEMS”; 61/293,611, filed Jan. 8, 2010 for “AMULTI-PETASCALE HIGHLY EFFICIENT PARALLEL SUPERCOMPUTER”; and61/295,669, filed Jan. 15, 2010 for “SPECULATION AND TRANSACTION IN ASYSTEM SPECULATION AND TRANSACTION SUPPORT IN L2 L1 SUPPORT FORSPECULATION/TRANSACTIONS IN A2 PHYSICAL ALIASING FOR THREAD LEVELSPECULATION MULTIFUNCTIONING L2 CACHE CACHING MOST RECENT DIRECTORY LOOKUP AND PARTIAL CACHE LINE SPECULATION SUPPORT”, the entire content anddisclosure of each of which is incorporated herein by reference; and isrelated to the following commonly-owned, co-pending United States patentapplications, the entire contents and disclosure of each of which isexpressly incorporated by reference herein as if fully set forth herein:U.S. patent application Ser. No. 12/684,367, filed Jan. 8, 2010, for“USING DMA FOR COPYING PERFORMANCE COUNTER DATA TO MEMORY”; U.S. patentapplication Ser. No. 12/684,172, filed Jan. 8, 2010 for “HARDWARESUPPORT FOR COLLECTING PERFORMANCE COUNTERS DIRECTLY TO MEMORY”; U.S.patent application Ser. No. 12/684,190, filed Jan. 8, 2010 for “HARDWAREENABLED PERFORMANCE COUNTERS WITH SUPPORT FOR OPERATING SYSTEM CONTEXTSWITCHING”; U.S. patent application Ser. No. 12/684,496, filed Jan. 8,2010 for “HARDWARE SUPPORT FOR SOFTWARE CONTROLLED FAST RECONFIGURATIONOF PERFORMANCE COUNTERS”; U.S. patent application Ser. No. 12/684,429,filed Jan. 8, 2010, for “HARDWARE SUPPORT FOR SOFTWARE CONTROLLED FASTMULTIPLEXING OF PERFORMANCE COUNTERS”; U.S. patent application Ser. No.______ (YOR920090533US1 (24682)), for “CONDITIONAL LOAD AND STORE IN ASHARED CACHE”; U.S. patent application Ser. No. 12/684,738, filed Jan.8, 2010, for “DISTRIBUTED PERFORMANCE COUNTERS”; U.S. patent applicationSer. No. 12/684,860, filed Jan. 8, 2010, for “PAUSE PROCESSOR HARDWARETHREAD ON PIN”; U.S. patent application Ser. No. 12/684,174, filed Jan.8, 2010, for “PRECAST THERMAL INTERFACE ADHESIVE FOR EASY AND REPEATED,SEPARATION AND REMATING”; U.S. patent application Ser. No. 12/684,184,filed Jan. 8, 2010, for “ZONE ROUTING IN A TORUS NETWORK”; U.S. patentapplication Ser. No. 12/684,852, filed Jan. 8, 2010, for “PROCESSORRESUME UNIT”; U.S. patent application Ser. No. 12/684,642, filed Jan. 8,2010, for “TLB EXCLUSION RANGE”; U.S. patent application Ser. No.12/684,804, filed Jan. 8, 2010, for “DISTRIBUTED TRACE USING CENTRALPERFORMANCE COUNTER MEMORY”; U.S. patent application Ser. No.61/293,237, filed Jan. 8, 2010, for “ORDERING OF GUARDED AND UNGUARDEDSTORES FOR NO-SYNC I/O”; U.S. patent application Ser. No. 12/693,972,filed Jan. 26, 2010, for “DISTRIBUTED PARALLEL MESSAGING FORMULTIPROCESSOR SYSTEMS”; U.S. patent application Ser. No. 12/688,747,filed Jan. 15, 2010, for “Support for non-locking parallel reception ofpackets belonging to the same reception FIFO”; U.S. patent applicationSer. No. 12/688,773, filed Jan. 15, 2010, for “OPCODE COUNTING FORPERFORMANCE MEASUREMENT”; U.S. patent application Ser. No. 12/684,776,filed Jan. 8, 2010, for “MULTI-INPUT AND BINARY REPRODUCIBLE, HIGHBANDWIDTH FLOATING POINT ADDER IN A COLLECTIVE NETWORK”; U.S. patentapplication Ser. No. ______ (YOR920090581US1 (24732)), for “SPECULATIONAND TRANSACTION IN A SYSTEM SPECULATION AND TRANSACTION SUPPORT IN L2 L1SUPPORT FOR SPECULATION/TRANSACTIONS IN A2 PHYSICAL ALIASING FOR THREADLEVEL SPECULATION MULTIFUNCTIONING L2 CACHE CACHING MOST RECENTDIRECTORY LOOK UP AND PARTIAL CACHE LINE SPECULATION SUPPORT”; U.S.patent application Ser. No. ______ (YOR920090582US1 (24733)), for“MEMORY SPECULATION IN A MULTI LEVEL CACHE SYSTEM”; U.S. patentapplication Ser. No. ______ (YOR920090583US1 (24738)), for “SPECULATIONAND TRANSACTION IN A SYSTEM SPECULATION AND TRANSACTION SUPPORT IN L2 L1SUPPORT FOR SPECULATION/TRANSACTIONS IN A2 PHYSICAL ALIASING FOR THREADLEVEL SPECULATION MULTIFUNCTIONING L2 CACHE CACHING MOST RECENTDIRECTORY LOOK UP AND PARTIAL CACHE LINE SPECULATION SUPPORT”; U.S.patent application Ser. No.______ (YOR920090584US1 (24739)), for“MINIMAL FIRST LEVEL CACHE SUPPORT FOR MEMORY SPECULATION MANAGED BYLOWER LEVEL CACHE”; U.S. patent application Ser. No. ______(YOR920090585US1 (24740)), for “PHYSICAL ADDRESS ALIASING TO SUPPORTMULTI-VERSIONING IN A SPECULATION-UNAWARE CACHE”; U.S. patentapplication Ser. No. 61/293,552, filed Jan. 8, 2010, for “LIST BASEDPREFETCH”; U.S. patent application Ser. No. 12/684,693, filed Jan. 8,2010, for “PROGRAMMABLE STREAM PREFETCH WITH RESOURCE OPTIMIZATION”;U.S. patent application Ser. No. 61/293,494, filed Jan. 8, 2010, for“NON-VOLATILE MEMORY FOR CHECKPOINT STORAGE”; U.S. patent applicationSer. No. 61/293,476, filed Jan. 8, 2010, for “NETWORK SUPPORT FOR SYSTEMINITIATED CHECKPOINTS”; U.S. patent application Ser. No. 61/293,554,filed Jan. 8, 2010, for “TWO DIFFERENT PREFETCHING COMPLEMENTARY ENGINESOPERATING SIMULTANEOUSLY”; U.S. patent application Ser. No. ______(YOR920090598US1 (24761)), for “DEADLOCK-FREE CLASS ROUTES FORCOLLECTIVE COMMUNICATIONS EMBEDDED IN A MULTI-DIMENSIONAL TORUSNETWORK”; U.S. patent application Ser. No. 61/293,559, filed Jan. 8,2010, for “IMPROVING RELIABILITY AND PERFORMANCE OF A SYSTEM-ON-A-CHIPBY PREDICTIVE WEAR-OUT BASED ACTIVATION OF FUNCTIONAL COMPONENTS”; U.S.patent application Ser. No. 61/293,569, filed Jan. 8, 2010, for“IMPROVING THE EFFICIENCY OF STATIC CORE TURNOFF IN A SYSTEM-ON-A-CHIPWITH VARIATION”; U.S. patent application Ser. No. ______(YOR920090586US1 (24861)), for “MULTIFUNCTIONING CACHE”; U.S. patentapplication Ser. No. ______ (YOR920090645US1 (24873)) for “I/O ROUTINGIN A MULTIDIMENSIONAL TORUS NETWORK”; U.S. patent application Ser. No.12/684,287, filed Jan. 8. 2010 for “ARBITRATION IN CROSSBAR INTERCONNECTFOR LOW LATENCY”; U.S. patent application Ser. No. 12/684,630, filedJan. 8, 2010 for “EAGER PROTOCOL ON A CACHE PIPELINE DATAFLOW”; U.S.patent application Ser. No. ______ (YOR920090648US1 (24876)) for“EMBEDDING GLOBAL BARRIER AND COLLECTIVE IN A TORUS NETWORK”; U.S.patent application Ser. No. 61/293,499, filed Jan. 8, 2010 for “GLOBALSYNCHRONIZATION OF PARALLEL PROCESSORS USING CLOCK PULSE WIDTHMODULATION”; U.S. patent application Ser. No. 61/293,266, filed Jan. 8,2010 for “IMPLEMENTATION OF MSYNC”; U.S. patent application Ser. No.______ (YOR920090651US1 (24879)) for “NON-STANDARD FLAVORS OF MSYNC”;U.S. patent application Ser. No. ______ (YOR920090652US1 (24881)) for“HEAP/STACK GUARD PAGES USING A WAKEUP UNIT”; U.S. patent applicationSer. No. 61/293,603, filed Jan. 8, 2010 for “MECHANISM OF SUPPORTINGSUB-COMMUNICATOR COLLECTIVES WITH O(64) COUNTERS AS OPPOSED TO ONECOUNTER FOR EACH SUB-COMMUNICATOR”; and U.S. patent application Ser. No.______ (YOR920100001US1 (24883)) for “REPRODUCIBILITY IN AMULTIPROCESSOR SYSTEM”.

GOVERNMENT CONTRACT

This invention was made with government support under Contract No.B554331 awarded by the Department of Energy. The Government has certainrights in this invention.

BACKGROUND OF THE INVENTION

1. Field of the Invention

The present invention, generally, relates to collective operations indata processing systems, and more specifically, to asynchronouscollective operations in multimode data processing systems.

2. Background Art

Parallel computer applications often use message passing to communicatebetween processors. Message passing utilities such as the MessagePassing Interface (MPI) support two types of communication:point-to-point and collective. In point-to-point messaging, a processorsends a message to another processor that is ready to receive it. In acollective communication operation, however, many processors participatetogether in the communication operation.

Collective communication operations play a very important role in highperformance computing. In collective communication, data areredistributed cooperatively among a group of processes. Sometimes theredistribution is accompanied by various types of computation on thedata and it is the results of the computation that are redistributed.MPI, which is the de facto message passing programming model standard,defines a set of collective communication interfaces, includingMPI_BARRIER, MPI_EBCAST, MPI_REDUCE, MPI_ALLREDUCE, MPI_ALLGATHER,MPI_ALLTOALL etc. These are application level interfaces and are moregenerally referred to as APIs. In MPI, collective communications arecarried out on communicators which define the participating processesand a unique communication context.

Functionally, each collective communication is equivalent to a sequenceof point-to-point communications, for which MPI defines MPI_SEND,MPI_RECEIVE and MPI_WAIT interfaces (and variants). MPI collectivecommunication operations are implemented with a layered approach inwhich the collective communication routines handle semantic requirementsand translate the collective communication function call into a sequenceof SFND/RECV/WAIT operations according to the algorithms used. Thepoint-to-point communication protocol layer guarantees reliablecommunication.

Collective communication operations can be synchronous or asynchronous.In a synchronous collective operation all processors have to reach thecollective before any data movement happens on the network. For example,all processors need to make the collective API or function call beforeany data movement happens on the network. Synchronous collectives alsoensure that all processors are participating in one or more collectiveoperations that can be determined locally. In an asynchronous collectiveoperation, there are no such restrictions and processors can startsending data as soon as the processors reach the collective operation.With asynchronous collective operations, several collectives can behappening simultaneously at the same time.

Asynchronous one-sided collectives that do not involve participation ofthe intermediate and destination processors are critical for achievinggood performance in a number of programming paradigms. For example, inan async one-sided broadcast, the root initiates the broadcast and alldestination processors receive the broadcast message without anyintermediate nodes forwarding the broadcast message to other nodes.

BRIEF SUMMARY

Embodiments of the invention provide a method, system, and computerprogram product for implementing an asynchronous collective operation ina multi-node data processing system. In one embodiment, the methodcomprises sending data to a plurality of nodes in the data processingsystem, broadcasting a remote get to said plurality of nodes, and usingsaid remote get to implement asynchronous collective operations on saiddata by the plurality of nodes.

In an embodiment, each of said plurality of nodes sets up a base addresstable with an entry for the source memory buffer associated with saideach node's contribution to the global sum.

In an embodiment, each of said plurality of nodes sets up a base addresstable with an entry for the address to the destination memory buffer onthe target where the summed output is copied to.

In an embodiment, all the nodes have the same physical address for thesource buffer and this address is placed once in the remote getdescriptor payload (this is an optimization to reduce of the baseaddress table by eliminating injection source buffers from it).

In an embodiment, each compute node has N processes and the root nodeinitiates N remote get operations (one for each process) to complete theone-sided reduce operation and then does a local sum to complete theone-sided reduce.

In an embodiment, the invention provides a mechanism for a MessagingUnit to initiate a one-sided allreduce operation by injecting a remoteget descriptor that is broadcast to all the compute nodes and initiatesa global-sum operation back to the initiating node or an arbitrarytarget node.

In one embodiment, each of said plurality of nodes sets up a baseaddress table with an entry for a base address of a memory bufferassociated with said each node. In an embodiment, said broadcastingincludes sending a defined communication to said plurality ofdestination nodes, and injecting a remote get descriptor into saiddefined communication. In an embodiment, said remote get descriptorincludes a put that reduces data back to a root node during saidasynchronous collective operations. In an embodiment, each of saidplurality of nodes performs only one task in the asynchronous collectiveoperation.

In one embodiment, each of the plurality of destination nodes performs aplurality of tasks in the collective operations, and each of the tasksof each of said plurality of nodes sets up a base address table with anentry for a base address of a memory buffer associated with said eachtask. In an embodiment, said broadcasting includes sending acommunication to said plurality of destination nodes, and injecting aremote get descriptor into said communication. In an embodiment, saiddescriptor includes a put that reduces data back to a root node. In oneembodiment, said put reduces data back to the root node from a first ofthe tasks performed by each of the nodes during the collectiveoperation, and said broadcasting includes sending a put to each of thetasks of each of the nodes to reduce data from said each task to theroot node.

In one embodiment, the invention uses the remote get collective toimplement one-sided operations. The compute node kernel (CNK) operatingsystem allows each MPI task to map the virtual to physical addresses ofall the other tasks in the booted partition. Moreover the remote-get anddirect put descriptors take physical address of the input buffers.

BRIEF DESCRIPTION OF THE SEVERAL VIEWS OF THE DRAWINGS

FIG. 1 is a block diagram of a data processing system in accordance withan embodiment of the invention.

FIG. 2 shows a messaging unit implemented by the node chip of FIG. 1.

FIG. 3 illustrates a set of components used to implement collectivecommunications in a multi-node processing system.

FIG. 4 illustrates a procedure, in accordance with an embodiment of theinvention, for a one-sided asynchronous reduce operation when there isonly one task per node.

FIG. 5 shows a procedure, in accordance with another embodiment of theinvention, for a one-sided asynchronous reduce operation when there ismore than one task per node.

DETAILED DESCRIPTION

As will be appreciated by one skilled in the art, the present inventionmay be embodied as a system, method or computer program product.Accordingly, the present invention may take the form of an entirelyhardware embodiment, an entirely software embodiment (includingfirmware, resident software, micro-code, etc.) or an embodimentcombining software and hardware aspects that may all generally bereferred to herein as a “circuit,” “module” or “system.” Furthermore,the present invention may take the form of a computer program productembodied in any tangible medium of expression having computer usableprogram code embodied in the medium.

Any combination of one or more computer usable or computer readablemedium(s) may be utilized. The computer-usable or computer-readablemedium may be, for example but not limited to, an electronic, magnetic,optical, electromagnetic, infrared, or semiconductor system, apparatus,device, or propagation medium. More specific examples (a non-exhaustivelist) of the computer-readable medium would include the following: anelectrical connection having one or more wires, a portable computerdiskette, a hard disk, a random access memory (RAM), a read-only memory(ROM), an erasable programmable read-only memory (EPROM or Flashmemory), an optical fiber, a portable compact disc read-only memory(CDROM), an optical storage device, a transmission media such as thosesupporting the Internet or an intranet, or a magnetic storage device.Note that the computer-usable or computer-readable medium could even bepaper or another suitable medium, upon which the program is printed, asthe program can be electronically captured, via, for instance, opticalscanning of the paper or other medium, then compiled, interpreted, orotherwise processed in a suitable manner, if necessary, and then storedin a computer memory. In the context of this document, a computer-usableor computer-readable medium may be any medium that can contain, store,communicate, propagate, or transport the program for use by or inconnection with the instruction execution system, apparatus, or device.The computer-usable medium may include a propagated data signal with thecomputer-usable program code embodied therewith, either in baseband oras part of a carrier wave. The computer usable program code may betransmitted using any appropriate medium, including but not limited towireless, wireline, optical fiber cable, RF, etc.

Computer program code for carrying out operations of the presentinvention may be written in any combination of one or more programminglanguages, including an object oriented programming language such asJava, Smalltalk, C++ or the like and conventional procedural programminglanguages, such as the “C” programming language or similar programminglanguages. The program code may execute entirely on the user's computer,partly on the user's computer, as a stand-alone software package, partlyon the user's computer and partly on a remote computer or entirely onthe remote computer or server. In the latter scenario, the remotecomputer may be connected to the user's computer through any type ofnetwork, including a local area network (LAN) or a wide area network(WAN), or the connection may be made to an external computer (forexample, through the Internet using an Internet Service Provider).

The present invention is described below with reference to flowchartillustrations and/or block diagrams of methods, apparatus (systems) andcomputer program products according to embodiments of the invention. Itwill be understood that each block of the flowchart illustrations and/orblock diagrams, and combinations of blocks in the flowchartillustrations and/or block diagrams, can be implemented by computerprogram instructions. These computer program instructions may beprovided to a processor of a general purpose computer, special purposecomputer, or other programmable data processing apparatus to produce amachine, such that the instructions, which execute via the processor ofthe computer or other programmable data processing apparatus, createmeans for implementing the functions/acts specified in the flowchartand/or block diagram block or blocks. These computer programinstructions may also be stored in a computer-readable medium that candirect a computer or other programmable data processing apparatus tofunction in a particular manner, such that the instructions stored inthe computer-readable medium produce an article of manufacture includinginstruction means which implement the function/act specified in theflowchart and/or block diagram block or blocks.

The computer program instructions may also be loaded onto a computer orother programmable data processing apparatus to cause a series ofoperational steps to be performed on the computer or other programmableapparatus to produce a computer implemented process such that theinstructions which execute on the computer or other programmableapparatus provide processes for implementing the functions/actsspecified in the flowchart and/or block diagram block or blocks.

Referring now to FIG. 1, there is shown the overall architecture of themultiprocessor computing node 50 implemented in a parallel computingsystem in which the present invention is implemented. In one embodiment,the multiprocessor system implements the proven Blue Gene® architecture,and is implemented in a BluGene/Q massively parallel computing systemcomprising, for example, 1024 compute node ASICs (BCQ), each includingmultiple processor cores.

The compute node 50 is a single chip (“nodechip”) based on low power A2PowerPC cores, though the architecture can use any low power cores, andmay comprise one or more semiconductor chips. In the embodimentdepicted, the node includes 16 PowerTC A2 at 1600 MHz, in cores in oneembodiment.

More particularly, the basic nodechip 50 of the massively parallelsupercomputer architecture illustrated in FIG. 1 includes (sixteen orseventeen) 16+1 symmetric multiprocessing (SMP) cores 52, each corebeing 4-way hardware threaded supporting transactional memory and threadlevel speculation, and, including a Quad Floating Point Unit (FPU) 53 oneach core (204.8 GF peak node). In one implementation, the coreoperating frequency target is 1.6 GHz providing, for example, a 563 GB/sbisection bandwidth to shared L2 cache 70 via a full crossbar switch 60.In one embodiment, there is provided 32 MB of shared L2 cache 70, eachcore having associated 2 MB of L2 cache 72. There is further providedexternal DDR SDRAM (e.g., Double Data Rate synchronous dynamic randomaccess) memory 80, as a lower level in the memory hierarchy incommunication with the L2. In one embodiment, the node includes 42.6GB/s DDR3 bandwidth (1.333 GHz DDR3) (2 channels each with chip killprotection).

Each FPU 53 associated with a core 52 has a 32 B wide data path to theL1-cache 55 of the A2, allowing it to load or store 32 B per cycle fromor into the L1-cache 55. Each core 52 is directly connected to a privateprefetch unit (level-1 prefetch, L1P) 58, which accepts, decodes anddispatches all requests sent out by the A2. The store interface from theA2 core 52 to the L1P 55 is 32 B wide and the load interface is 16 Bwide, both operating at processor frequency. The L1P 55 implements afully associative, 32 entry prefetch buffer. Each entry can hold an L2line of 128 B size. The L1P provides two prefetching schemes for theprivate prefetch unit 58: a sequential prefetcher as used in previousBlueGene architecture generations, as well as a list prefetcher.

As shown in FIG. 1, the 32 MiB shared L2 is sliced into 16 units, eachconnecting to a slave port of the switch 60. Every physical address ismapped to one slice using a selection of programmable address bits or aXOR-based hash across all address bits. The L2-cache slices, the L1Psand the L1-D caches of the A2s are hardware-coherent. A group of 4slices is connected via a ring to one of the two DDR3 SDRAM controllers78.

By implementing a direct memory access engine referred to herein as aMessaging Unit, “MU” such as MU 100, with each MU including a DMA engineand Network Card interface in communication with the XBAR switch, chipI/O functionality is provided. In one embodiment, the compute nodefurther includes, in a non-limiting example: 10 intra-rackinterprocessor links 90, each at 2.0 GB/s, for example, i.e., 10*2 GB/sintra-rack & inter-rack (e.g., configurable as a 5-D torus in oneembodiment); and, one I/O link 92 interfaced with the MU at 2.0 GB/s (2GB/s I/O link (to I/O subsystem)) is additionally provided. The systemnode employs or is associated and interfaced with a 8-16 GB memory/node.The ASIC may consume up to about 30 watts chip power.

Although not shown, each A2 core has associated a quad-wide fusedmultiply-add SIMD floating point unit, producing 8 double precisionoperations per cycle, for a total of 128 floating point operations percycle per compute chip. A2 is a 4-way multi-threaded 64 b PowerPCimplementation. Each A2 core has its own execution unit (XU),instruction unit (IU), and quad floating point unit (QPU) connected viathe AXU (Auxiliary eXecution Unit) (FIG. 1). The QPU is animplementation of the 4-way SIND QPX floating point instruction setarchitecture. QPX is an extension of the scalar PowerPC floating pointarchitecture. It defines 32 32 B-wide floating point registers perthread instead of the traditional 32 scalar 8 B-wide floating pointregisters.

As mentioned above, the compute nodechip implements a direct memoryaccess engine referred to herein as a Messaging Unit, “MU.” Oneembodiment of an MU is shown in FIG. 2 at 100. MU 100 transfers blocksvia three switch master ports between the L2-caches 70 (FIG. 1) and thereception FIFOs 190 and transmission FIFOs 180 of the network interface150. It is controlled by the cores via memory mapped I/O access throughan additional switch slave port.

In one embodiment, one function of the messaging unit 100 is to ensureoptimal data movement to, and from the network into the local memorysystem. It supports injection and reception of messages, as well as dataprefetching into the memory, and on-chip memory copy. On the injectionside, the MU splits and packages messages into network packets, andsends packets to the network respecting the network protocol. On packetinjection, the messaging unit distinguishes between packet injection,and memory prefetching packets. A memory prefetch mode is supported inwhich the MU fetches a message into L2, but does not send it. On thereception side, it receives network packets, and writes them into theappropriate location in memory, depending on the network protocol. Onpacket reception, the messaging unit 100 distinguishes between threedifferent types of packets, and accordingly performs differentoperations. The types of packets supported are: memory FIFO packets,direct put packets, and remote get packets.

The messaging unit also supports local memory copy, where the MU copiesan area in the local memory to another area in the memory. Formemory-to-memory on chip data transfer, a dedicated SRAM buffer, locatedin the network device, is used. Remote gets and the corresponding directputs can be “paced” by software to reduce contention within the network.In this software-controlled paced mode, a remote get for a long messageis broken up into multiple remote gets, each for a sub-message. Thesub-message remote get is only allowed to enter the network if thenumber of packets belonging to the paced remote get active in thenetwork is less than an allowed threshold. Software has to carefullycontrol the pacing, otherwise deadlocks can occur.

The top level architecture of the Messaging Unit 100 interfacing withthe Network interface Device (ND) 150 is shown in FIG. 2. The MessagingUnit 100 functional blocks involved with injection control as shown inFIG. 2 includes the following: Injection control units 105 implementinglogic for queuing and arbitrating the processors' requests to thecontrol areas of the injection MU; Reception control units 115Implementing logic for queuing and arbitrating the requests to thecontrol areas of the reception MU; Injection iMEs (injection MessageElements) 110 that reads data from L2 cache or DDR memory and inserts itin the network injection FIFOs 180, or in the local copy FIFO 185.Reception rMEs (reception Message Elements) 120 that reads data from thenetwork reception FIFOs 90, and inserts them into L2. In one embodiment,there are 16 rMEs 120, one for each network reception FIFO. A DCR Unit128 is provided that includes DCR registers for the MU 100.

The MU 100 further includes an Interface to a cross-bar switch (XBAR)switch 55 (or SerDes) switches in additional implementations. The MUoperates at clock/2 (e.g., 800 MHz). The Network Device 150 operates at500 MHz (e.g., 2 GB/s network). The MU 100 includes three (3) Xbarmasters 125 to sustain network traffic and two (2) Xbar slaves 126 forprogramming. A DCR slave interface unit 127 is also provided.

The handover between network device 150 and MU 100 is performed via2-port SRAMs for network injection/reception FIFOs. The MU 100reads/writes one port using, for example, an 800 MHz clock, and thenetwork reads/writes the second port with a 500 MHz clock. The onlyhandovers are through the FIFOs and FIFOs' pointers (which areimplemented using latches).

The injection side MU maintains injection FIFO pointers, as well asother hardware resources for putting messages into the 5-D torusnetwork. Injection FIFOs are allocated in main memory and each FIFOcontains a number of message descriptors. Each descriptor is 64 bytes inlength and includes a network header for routing, the base address andlength of the message data to be sent, and other fields like type ofpackets, etc., for the reception MU at the remote node. A processor coreprepares the message descriptors in injection FIFOs and then updates thecorresponding injection FIFO pointers in the MU. The injection MU readsthe descriptors and message data packetizes messages into networkpackets and then injects them into the 5-D torus network.

Three types of network packets are supported: (1) Memory FIFO packets:the reception MU writes packets including both network headers and datapayload into pre-allocated reception FIFOs in main memory. The MUmaintains pointers to each reception FIFO. The received packets arefurther processed by the cores; (2) Put packets; the reception MU writesthe data payload of the network packets into main memory directly, ataddresses specified in network headers. The MU updates a message bytecount after each packet is received. Processor cores are not involved indata movement, and only have to check that the expected numbers of bytesare received by reading message byte counts; (3) Get packets; the datapayload contains descriptors for the remote nodes. The MU on a remotenode receives each get packet into one of its injection FIFOs, thenprocesses the descriptors and sends data back to the source node.

All MU resources are in memory mapped I/O address space and provideuniform access to all processor cores. In practice, the resources arelikely grouped into smaller groups to give each core dedicated access.The preferred embodiment is to support 544 injection FIFOs, or 32/core,and 288 reception FIFOs, or 16/core. The reception byte counts for putmessages are implemented in L2 using atomic counters. There iseffectively an unlimited number of counters subject to the limit ofavailable memory for such atomic counters.

The MU interface is designed to deliver close to the peak 18 GB/s(send)+18/GBs (receive) 5-D torus nearest neighbor data bandwidth, whenthe message data is fully contained in the 32 MB L2. This is basically1.8 GB/s+1.8 GB/s maximum data payload bandwidth over 10 torus links.When the total message data size exceeds the 32 MB L2, the maximumnetwork bandwidth is then limited by the sustainable external DDR memorybandwidth.

The torus network supports both point to point operations and collectivecommunication operations. The collective communication operationssupported are barrier, broadcast, reduce and allreduce. For example, abroadcast put descriptor will place the broadcast payload on all thenodes in the class route (a predetermined route set up for a group ofnodes in the MPI communicator). Similarly there are collective putreduce and broadcast operations. A remote get (with a reduce put payloadcan be broadcast) to all the nodes from where data will be reduced viathe put descriptor.

FIG. 3 illustrates a set of components that support collectiveoperations in a multi-node processing system. These components include acollective API 302, language adapter 304, executor 306, and multisendinterface 310.

Each application or programming language may implement a collective API302 to invoke or call collective operation functions. A user applicationfor example implemented in that application programming language thenmay make the appropriate function calls for the collective operations.Collective operations may be then performed via the API adaptor 304using its internal components such as an MPI communicator 312, inaddition to the other components in the collective framework, such asscheduler 314, executor 306, and multisend interface 310.

Language adaptor 304 interfaces the collective framework to aprogramming language. For example, a language adaptor such as for amessage passing interface (MPI) has a communicator component 312.Briefly, an MPI communicator is an object with a number of attributesand rules that govern its creation, use, and destruction. Thecommunicator 312 determines the scope and the “communication universe”in which a point-to-point or collective operation is to operate. Eachcommunicator 312 contains a group of valid participants and the sourceand destination of a message is identified by process rank within thatgroup.

Executor 306 may handle functionalities for specific optimizations suchas pipelining, phase independence and multi-color routes. An executormay query a schedule on the list of tasks and execute the list of tasksreturned by the scheduler 314. Typically, each collective operations isassigned one executor.

The scheduler 314 handles a functionality of collective operations andalgorithms, and includes a set of steps in the collective algorithm thatexecute a collective operation. Scheduler 314 may split a collectiveoperation into phases. For example, a broadcast can be done through aspanning tree schedule where in each phase, a message is sent from onenode to the next level of nodes in the spanning tree. In each phase,scheduler 314 lists sources that will send a message to a processor anda list of tasks that need to be performed in that phase.

Multisend interface 310 provides an interface to multisend 316, which isa message passing backbone of a collective framework. Multisendfunctionality allows sending many messages at the same time, eachmessage or a group of messages identified by a connection identifier.Multisend functionality also allows an application to multiplex data onthis connection identifier.

As mentioned above, asynchronous one-sided collectives that do notinvolve participation of the intermediate and destination processors arecritical for achieving good performance in a number of programmingparadigms. For example, in an async one-sided broadcast, the rootinitiates the broadcast and all destination processors receive thebroadcast message without any intermediate nodes forwarding thebroadcast message to other nodes.

Embodiments of the present invention provide a method and system forone-sided asynchronous reduce operation. Embodiments of the inventionuse the remote get collective to implement one-sided operations. Thecompute node kernel (CNK) operating system allows each MN task to mapthe virtual to physical addresses of all the other tasks in the bootedpartition. Moreover the remote-get and direct put descriptors takephysical address of the input buffers.

Two specific example embodiments are described below. One embodiment,represented in FIG. 4, may be used when there is only one task per node;and a second embodiment, represented in FIG. 5, may be used when thereis more than one task per node.

With reference to FIG. 4, at step 402, each node sets up a base addresstable with an entry for the base address of the buffer to be reduced. Atstep 404, the root of the collective injects a broadcast remote getdescriptor whose payload is a put that reduces data back to the rootnode. The offset on each node must be the same from the addressprogrammed in the base address table. This is common in PGAS runtimeswhere the same array index must be reduced on all the nodes. At step406, when the reduce operation completes, the root node has the sum ofall the nodes in the communicator.

In the procedure illustrated in FIG. 5, at step 502, each of the n tasksset up a base address table with an entry for the base address of thebuffer to be reduced. At step 504, the root of the collective injects abroadcast remote get descriptor whose payload is a put that reduces databack to the root node for task 0 on each node of the communicator. Theoffset on each node must be the same from the address programmed in thebase address table. The root then injects a collective remote get fortask 1 and the process is repeated till n tasks. As the remote gets arebroadcast in a specific order, the reduce results will also complete inthat order. At step 506, after the n remote gets have completed, theroot node can locally sum the n results and compute the final reduceacross all the n tasks on all the nodes.

While it is apparent that the invention herein disclosed is wellcalculated to fulfill the objects discussed above, it will beappreciated that numerous modifications and embodiments may be devisedby those skilled in the art, and it is intended that the appended claimscover all such modifications and embodiments as fall within the truespirit and scope of the present invention.

1. A method of implementing an asynchronous collective operation in amulti-node data processing system, the method comprising: sending datato a plurality of nodes in the data processing system; broadcasting aremote get to said plurality of nodes; and using said remote get toimplement asynchronous collective operations on said data by theplurality of nodes.
 2. The method according to claim 1, wherein each ofsaid plurality of nodes sets up a base address table with an entry forthe source memory buffer associated with said each node's contributionto the global sum.
 3. The method according to claim 1, wherein each ofsaid plurality of nodes sets up a base address table with an entry forthe address to the destination memory buffer on the target where thesummed output is copied to.
 4. The method according to claim 1 where allthe nodes have the same physical address for the source buffer and thisaddress is placed once in the remote get descriptor payload.
 5. Themethod according to claim 1, where each compute node has N processes andthe root node initiates N remote get operations, one for each process,to complete the one-sided reduce operation and then does a local sum tocomplete the one-sided reduce.
 6. The method according to claim 1,wherein said broadcasting includes sending a defined communication tosaid plurality of destination nodes, and injecting a remote getdescriptor into said defined communication, and wherein said remote getdescriptor includes a put that reduces data back to a root node duringsaid asynchronous collective operations.
 7. The method according toclaim 1, wherein each of said plurality of nodes performs only one taskin said asynchronous collective operations.
 8. The method according toclaim 1, wherein each of the plurality of destination nodes performs aplurality of tasks in said collective operations, and wherein each ofthe tasks of each of said plurality of nodes sets up a base addresstable with an entry for a base address of a memory buffer associatedwith aid each task.
 9. The method according to claim 8, wherein saidbroadcasting includes sending a communication to said plurality ofdestination nodes, and injecting a remote get descriptor into saidcommunication; said descriptor includes a put that reduces data back toa root node. said put reduces data back to the root node form a first ofthe tasks performed by each of the nodes during said collectiveoperations.
 10. The method according to claim 7, wherein saidbroadcasting includes sending a put to each of the tasks of each of thenodes to reduce data from said each task to a root node.
 11. A systemfor implementing an asynchronous collective operation in a multi-nodedata processing system, the system comprising one or more processingnodes of the data processing system configured for: sending data to aplurality of nodes in the data processing system; broadcasting a remoteget to said plurality of nodes; and using said remote get to implementasynchronous collective operations on said data by the plurality ofnodes.
 12. The system according to claim 11, wherein the broadcastingincludes using a mechanism for said Messaging Unit to initiate aone-sided allreduce operation by injecting a remote get descriptor thatis broadcast to all the compute nodes and initiates a global-sumoperation back to the initiating node or an arbitrary target node. 13.The system according to claim 11, wherein each of said plurality ofnodes sets up a base address table with an entry for a base address of amemory buffer associated with said each node.
 14. The system accordingto claim 11, wherein each of the plurality of destination nodes performsa plurality of tasks in said collective operations, and wherein each ofthe tasks of each of said plurality of nodes sets up a base addresstable with an entry for a base address of a memory buffer associatedwith aid each task.
 15. The system according to claim 11, wherein saidbroadcasting includes sending a put to each of the tasks of each of thenodes to reduce data from said each task to a root node.
 16. An articleof manufacture comprising: at least one tangible computer readablemedium having computer readable program code logic to execute machineinstructions in one or more processing units for implementing anasynchronous collective operation in a multi-node data processingsystem, the method comprising: sending data to a plurality of nodes inthe data processing system; broadcasting a remote get to said pluralityof nodes; and using said remote get to implement asynchronous collectiveoperations on said data by the plurality of nodes.
 17. The article ofmanufacture according to claim 16, wherein each of said plurality ofnodes sets up a base address table with an entry for a base address of amemory buffer associated with said each node.
 18. The article ofmanufacture according to claim 17, wherein each of said plurality ofnodes performs only one task in said asynchronous collective operations.19. The article of manufacture according to claim 17, wherein each ofthe plurality of destination nodes performs a plurality of tasks in saidcollective operations, and wherein each of the tasks of each of saidplurality of nodes sets up a base address table with an entry for a baseaddress of a memory buffer associated with aid each task.
 20. Thearticle of manufacture according to claim 19, wherein said broadcastingincludes sending a put to each of the tasks of each of the nodes toreduce data from said each task to a root node.
 21. A method ofimplementing an asynchronous collective operation in a multi-node dataprocessing system, the method comprising: sending data to a plurality ofnodes in the data processing system; broadcasting a remote get to saidplurality of nodes; and using said remote get to implement asynchronouscollective operation on said data by the plurality of nodes; and whereinsaid broadcasting includes sending a defined communication to saidplurality of destination nodes, and injecting a remote get descriptorinto said defined communication, said remote get descriptor including aput that reduces data back to a root node during said asynchronouscollective operation.
 22. The method according to claim 21, wherein eachof said plurality of nodes performs only one task in said asynchronouscollective operations, and each of said plurality of nodes sets up abase address table with an entry for a base address of a memory bufferassociated with said each node.
 23. The method according to claim 21,wherein each of the plurality of destination nodes performs a pluralityof tasks in said collective operations, and wherein each of the tasks ofeach of said plurality of nodes sets up a base address table with anentry for a base address of a memory buffer associated with aid eachtask.
 24. The method according to claim 23, wherein said broadcastingincludes sending a put to each of the tasks of each of the nodes toreduce data from said each task to the root node.
 25. The methodaccording to claim 21, wherein said broadcasting includes said root nodeinjecting said remote get descriptor into said defined communication.